Conversation
aaron-kaplan
left a comment
There was a problem hiding this comment.
The overall approach of label the data, group by labels, aggregate over groups feels good to me.
I'm not so confident about dayofyear366, because of the confusion around leap days. I've thought about that problem enough to know that thinking about it more probably isn't going to help, one just has to pick one of the several imperfect solutions available. I'm curious why you felt you needed to implement it now. Is isn't relevant to the onset/cessation work as far as I can see.
enacts/calc.py
Outdated
There was a problem hiding this comment.
I think you're assuming that the T coordinate is dense, i.e. contains every day between its first value and its last. What if it's missing days? Specifically, what if it's missing the season start or end day in some years? I think we should just construct the edges arrays from scratch rather than extracting them from T with sel_day_and_month.
There was a problem hiding this comment.
I pushed a proposed solution to this. Have a look.
There was a problem hiding this comment.
I think you've fixed the problem. Not sure it's the simplest solution--what if we just generated a dense T coordinate, created start and end edges based on that, and then selected from the dense coordinate using the indices of the sparse one? Implementation detail and thus of secondary concern.
There was a problem hiding this comment.
just pushed something else.
There was a problem hiding this comment.
The implementation still seems unnecessarily complicated, but that's something to work together on interactively rather than via PR comments, and doesn't need to block progress. What's most important is the interface and the tests.
Regarding the interface, either assign_season_coords should be renamed select_season (select_and_label_season?), or it should only label seasons and not drop data outside of the season (set season start and end to NaT outside the season).
Regarding tests, here are some properties of the function that could be tested:
- Behavior when start of the data is before the season, during the season, after the season. Likewise at the end of the data.
- Confirm that all the seasons in the range of the data have been labeled/retained. If we mistakenly considered all of the data points to be out of season, the assertion you're currently using would pass.
- Confirm that season ends are all the right number of days after season starts, i.e. that we haven't accidentally created seasons that span multiple years.
There was a problem hiding this comment.
I think all that was mentioned here has been addressed? except:
"Confirm that all the seasons in the range of the data have been labeled/retained. If we mistakenly considered all of the data points to be out of season, the assertion you're currently using would pass."
You're saying that all my tests are confirming that what I retained is within their appropriate seasons, but I am not testing that I did retain all that I should have?
There was a problem hiding this comment.
Yes, that seems to be what I was saying.
There was a problem hiding this comment.
I can't wrap my head around what the test(s) could be...
There was a problem hiding this comment.
Construct the expected T coordinate, assert that the T coordinate returned by the function is identical to that?
There was a problem hiding this comment.
I pushed something to that effect
I did it then mostly because I had an epiphany and wanted to... battre le fer tant qu'il est chaud. I do want to make a daily climatology for the water balance work so there is a little bit of motivation from there. If we validate that route, it will impact also the onset/cessation work to compute them for multiple years. This is now done by our seasonal_ functions that rely on dailytobegroupedby (that you didn't like) that this work is heavily inspired from. It really is the same thing, only written (API?) differently. |
|
@aaron-kaplan I am going to continue our conversation here. First, I can split the seasonal work from the climato/ano work since they don't share code and since the seasonal thing is more pressing now. I realize that there is an additional difficulty with the water balance application of seasonal grouping. Which is that water balance accepts multiple inputs and return multiple outputs. It turns out that there is a Dataset.groupby but so that means I also need to make water balance functions accept its inputs as one Dataset, and returns another Dataset. So that I can apply the labelling-dropping function to it, then the grouping function (groupby), then the operator on groups... |
|
Adding Xandre more systematically now on this "core" work... even if only for FYI. This talks about the seasonal case only now. I moved the clim/ano part into another PR. |
|
Revisited this based on Tuesday conversation. Need to work on doc, adding in particular examples of usage. Tests are now real tests. Want to add test for sparse data |
aaron-kaplan
left a comment
There was a problem hiding this comment.
My questions about what the function does are important; questions about implementation details are secondary and don't necessarily need to be resolved before we can move forward.
I think this retains partial seasons at the start and end. Is that the right (most useful) behavior? I think I've observed that Ingrid drops partial seasons. But probably in the seasonalAverage function, which does more than this function does.
enacts/calc.py
Outdated
There was a problem hiding this comment.
Seasonal analysis is relevant to monthly and dekadal data too. Should those have their own functions, or should we leave room to generalize this one?
There was a problem hiding this comment.
called it time_series. Indeed as I look a the code, it doesn't look like it would care that the data is dekadal or monthly. It's not like xarray understands so much about things being daily anyways.
We might want to get to the bottom of this then and make start_/end_day optional because it would be awkward to ask for them in the "monthly" case
|
@aaron-kaplan can we move this on? |
| end_month = 3 | ||
| day_offset = -1 | ||
| else: | ||
| day_offset = 0 |
There was a problem hiding this comment.
Maybe end_month/end_day should always represent the first day that's not included in the season? I.e. day_offset is always -1, then the DJF season would be represented with end_month = 3, end_month = 1, and JFM with end_month = 4, end_day = 1. Advantages: no special case --> less likelihood of a bug in the special case that we don't notice because we failed to test it; corresponds to e.g. how arrays are sliced in python: '012345'[0:4] == '0123', not '01234'`. Disadvantage: for some reason xarray slicing, unlike python slicing, uses intervals that are closed on both ends, so that could be confusing.
There was a problem hiding this comment.
It's late in the review cycle for me to be introducing something like this. Feel free to ignore if you don't like the suggestion.
There was a problem hiding this comment.
I can see arguments for (or against) either way.
I am not sure there won't be any special case. If the user was a season ending in 28 Feb (included), would they put 29 Feb or 1 Mar? If they put 29 Feb, they would miss 3 28 Feb out of 4; if they put 1 Mar, they would include 29 Feb every 4 years. At least if using sel_day_and_month...
If we envision to function to work also for the monhtly case, then start/end_day should be an optional input and having its default value at 1 would be straightforward with open interval end. If we keep the intervals closed on both ends, we should probably set them to None, have start be 1, and have end be 1 too with an offset of -1 so that when users don't enter days, they get full months. So it's more gymnastics (but not crazy) in that case.
Then, as we try to build on top of xarray (well at least that is what I am trying to do), I feel we should try and opt xarray conventions, and thus keep closed intervals... That's not a very strong argument.
There was a problem hiding this comment.
If they put 29 Feb, they would miss 3 28 Feb out of 4
I don't understand. Why would they miss Feb 28?
Xarray's choice to define slices as closed on both ends doesn't make sense to me. You can't partition a range of real values into bins using intervals that are closed on both ends. Indeed, groupby_bins defaults to intervals that are open on the left and closed on the right, which is doubly weird, because it's consistent with neither the behavior of .sel nor the more general convention, which (as I see it) is for intervals to be closed on the left and open on the right.
There was a problem hiding this comment.
They would miss it if we keep on using sel_day_and_month
Yes I noticed the seemingly strange behavior of groupby_bins with respect to intervals edges inclusivity. They just reproduced the pandas.cut behavior though. And maybe it is so because this is more about categorizing rather than slicing? Maybe we should not try, use and think through the groupby prism but through the resample one?
Namely to do "seasonal" analysis, ie to apply same analysis to a slice of daily data within each year and to do "daily climatology" and back from clim to real time to, for instance, compute anomalies.
There are 2 functions and 2 tests. Tests aren't really tests but rather how I envision the usage (or what would be inside functions easier for the user to call). seasonal_groups creates the groups for seasonal analysis while dayofyear366 creates the groups for climatologies.
The first test function shows one can apply one after the other different functions to seasons only (of course no more seasonal analysis can be done after a reducing function is applied -- and also, one doesn't have to apply a reducing function, the end result is then a normal daily variable, only with data points only at days of seasons). Here seasonal_groups ensures that we keep only complete seasons, but I think it should keep also truncated seasons at extremities and let user slice accordingly if they don't want incomplete seasons. I labeled the groups with the first day of the season, but we should also keep the last day. Or keep the "groups" attribute some how that is a dictionary of lists showing all the points associated with each label.
The climato/ano case... I haven't much to comment. Should we be given a 365-day climato, it would be easy to project it to a 366-day one such as the one I created and fill Feb 29 with something and have the user have a few options about what to fill with.
Also with this set up, we need not create functions for each seasonal (e.g. seasonal_sum, seasonal_onset) or climato/ano analysis (e.g. clim: mean, stddev...; ano: subtract, divide...), but can describe the process in documentation, or make functions calling functions (this time allowing users to see them):
seasonal(daily_data, start_day, start_month, end_day, end_month, [func1, func2...])
climato(daily_data, [func1, func2...])
ano(daily_data, [func1, func2...])
(also, this is not necessarily meant to merge: this is playground.... I thought there was a label or something for that but now I can't see it)